104 research outputs found

    Efficient feature reduction and classification methods

    Get PDF
    Durch die steigende Anzahl verfügbarer Daten in unterschiedlichsten Anwendungsgebieten nimmt der Aufwand vieler Data-Mining Applikationen signifikant zu. Speziell hochdimensionierte Daten (Daten die über viele verschiedene Attribute beschrieben werden) können ein großes Problem für viele Data-Mining Anwendungen darstellen. Neben höheren Laufzeiten können dadurch sowohl für überwachte (supervised), als auch nicht überwachte (unsupervised) Klassifikationsalgorithmen weitere Komplikationen entstehen (z.B. ungenaue Klassifikationsgenauigkeit, schlechte Clustering-Eigenschaften, …). Dies führt zu einem Bedarf an effektiven und effizienten Methoden zur Dimensionsreduzierung. Feature Selection (die Auswahl eines Subsets von Originalattributen) und Dimensionality Reduction (Transformation von Originalattribute in (Linear)-Kombinationen der Originalattribute) sind zwei wichtige Methoden um die Dimension von Daten zu reduzieren. Obwohl sich in den letzten Jahren vielen Studien mit diesen Methoden beschäftigt haben, gibt es immer noch viele offene Fragestellungen in diesem Forschungsgebiet. Darüber hinaus ergeben sich in vielen Anwendungsbereichen durch die immer weiter steigende Anzahl an verfügbaren und verwendeten Attributen und Features laufend neue Probleme. Das Ziel dieser Dissertation ist es, verschiedene Fragenstellungen in diesem Bereich genau zu analysieren und Verbesserungsmöglichkeiten zu entwickeln. Grundsätzlich, werden folgende Ansprüche an Methoden zur Feature Selection und Dimensionality Reduction gestellt: Die Methoden sollten effizient (bezüglich ihres Rechenaufwandes) sein und die resultierenden Feature-Sets sollten die Originaldaten möglichst kompakt repräsentieren können. Darüber hinaus ist es in vielen Anwendungsgebieten wichtig, die Interpretierbarkeit der Originaldaten beizubehalten. Letztendlich sollte der Prozess der Dimensionsreduzierung keinen negativen Effekt auf die Klassifikationsgenauigkeit haben - sondern idealerweise, diese noch verbessern. Offene Problemstellungen in diesem Bereich betreffen unter anderem den Zusammenhang zwischen Methoden zur Dimensionsreduzierung und der resultierenden Klassifikationsgenauigkeit, wobei sowohl eine möglichst kompakte Repräsentation der Daten, als auch eine hohe Klassifikationsgenauigkeit erzielt werden sollen. Wie bereits erwähnt, ergibt sich durch die große Anzahl an Daten auch ein erhöhter Rechenaufwand, weshalb schnelle und effektive Methoden zur Dimensionsreduzierung entwickelt werden müssen, bzw. existierende Methoden verbessert werden müssen. Darüber hinaus sollte natürlich auch der Rechenaufwand der verwendeten Klassifikationsmethoden möglichst gering sein. Des Weiteren ist die Interpretierbarkeit von Feature Sets zwar möglich, wenn Feature Selection Methoden für die Dimensionsreduzierung verwendet werden, im Fall von Dimensionality Reduction sind die resultierenden Feature Sets jedoch meist Linearkombinationen der Originalfeatures. Daher ist es schwierig zu überprüfen, wie viel Information einzelne Originalfeatures beitragen. Im Rahmen dieser Dissertation konnten wichtige Beiträge zu den oben genannten Problemstellungen präsentiert werden: Es wurden neue, effiziente Initialisierungsvarianten für die Dimensionality Reduction Methode Nonnegative Matrix Factorization (NMF) entwickelt, welche im Vergleich zu randomisierter Initialisierung und im Vergleich zu State-of-the-Art Initialisierungsmethoden zu einer schnelleren Reduktion des Approximationsfehlers führen. Diese Initialisierungsvarianten können darüber hinaus mit neu entwickelten und sehr effektiven Klassifikationsalgorithmen basierend auf NMF kombiniert werden. Um die Laufzeit von NMF weiter zu steigern wurden unterschiedliche Varianten von NMF Algorithmen auf Multi-Prozessor Systemen vorgestellt, welche sowohl Task- als auch Datenparallelismus unterstützen und zu einer erheblichen Reduktion der Laufzeit für NMF führen. Außerdem wurde eine effektive Verbesserung der Matlab Implementierung des ALS Algorithmus vorgestellt. Darüber hinaus wurde eine Technik aus dem Bereich des Information Retrieval -- Latent Semantic Indexing -- erfolgreich als Klassifikationsalgorithmus für Email Daten angewendet. Schließlich wurde eine ausführliche empirische Studie über den Zusammenhang verschiedener Feature Reduction Methoden (Feature Selection und Dimensionality Reduction) und der resultierenden Klassifikationsgenauigkeit unterschiedlicher Lernalgorithmen präsentiert. Der starke Einfluss unterschiedlicher Methoden zur Dimensionsreduzierung auf die resultierende Klassifikationsgenauigkeit unterstreicht dass noch weitere Untersuchungen notwendig sind um das komplexe Zusammenspiel von Dimensionsreduzierung und Klassifikation genau analysieren zu können.The sheer volume of data today and its expected growth over the next years are some of the key challenges in data mining and knowledge discovery applications. Besides the huge number of data samples that are collected and processed, the high dimensional nature of data arising in many applications causes the need to develop effective and efficient techniques that are able to deal with this massive amount of data. In addition to the significant increase in the demand of computational resources, those large datasets might also influence the quality of several data mining applications (especially if the number of features is very high compared to the number of samples). As the dimensionality of data increases, many types of data analysis and classification problems become significantly harder. This can lead to problems for both supervised and unsupervised learning. Dimensionality reduction and feature (subset) selection methods are two types of techniques for reducing the attribute space. While in feature selection a subset of the original attributes is extracted, dimensionality reduction in general produces linear combinations of the original attribute set. In both approaches, the goal is to select a low dimensional subset of the attribute space that covers most of the information of the original data. During the last years, feature selection and dimensionality reduction techniques have become a real prerequisite for data mining applications. There are several open questions in this research field, and due to the often increasing number of candidate features for various application areas (e.\,g., email filtering or drug classification/molecular modeling) new questions arise. In this thesis, we focus on some open research questions in this context, such as the relationship between feature reduction techniques and the resulting classification accuracy and the relationship between the variability captured in the linear combinations of dimensionality reduction techniques (e.\,g., PCA, SVD) and the accuracy of machine learning algorithms operating on them. Another important goal is to better understand new techniques for dimensionality reduction, such as nonnegative matrix factorization (NMF), which can be applied for finding parts-based, linear representations of nonnegative data. This ``sum-of-parts'' representation is especially useful if the interpretability of the original data should be retained. Moreover, performance aspects of feature reduction algorithms are investigated. As data grow, implementations of feature selection and dimensionality reduction techniques for high-performance parallel and distributed computing environments become more and more important. In this thesis, we focus on two types of open research questions: methodological advances without any specific application context, and application-driven advances for a specific application context. Summarizing, new methodological contributions are the following: The utilization of nonnegative matrix factorization in the context of classification methods is investigated. In particular, it is of interest how the improved interpretability of NMF factors due to the non-negativity constraints (which is of central importance in various problem settings) can be exploited. Motivated by this problem context two new fast initialization techniques for NMF based on feature selection are introduced. It is shown how approximation accuracy can be increased and/or how computational effort can be reduced compared to standard randomized seeding of the NMF and to state-of-the-art initialization strategies suggested earlier. For example, for a given number of iterations and a required approximation error a speedup of 3.6 compared to standard initialization, and a speedup of 3.4 compared to state-of-the-art initialization strategies could be achieved. Beyond that, novel classification methods based on the NMF are proposed and investigated. We can show that they are not only competitive in terms of classification accuracy with state-of-the-art classifiers, but also provide important advantages in terms of computational effort (especially for low-rank approximations). Moreover, parallelization and distributed execution of NMF is investigated. Several algorithmic variants for efficiently computing NMF on multi-core systems are studied and compared to each other. In particular, several approaches for exploiting task and/or data-parallelism in NMF are studied. We show that for some scenarios new algorithmic variants clearly outperform existing implementations. Last, but not least, a computationally very efficient adaptation of the implementation of the ALS algorithm in Matlab 2009a is investigated. This variant reduces the runtime significantly (in some settings by a factor of 8) and also provides several possibilities to be executed concurrently. In addition to purely methodological questions, we also address questions arising in the adaptation of feature selection and classification methods to two specific application problems: email classification and in silico screening for drug discovery. Different research challenges arise in the contexts of these different application areas, such as the dynamic nature of data for email classification problems, or the imbalance in the number of available samples of different classes for drug discovery problems. Application-driven advances of this thesis comprise the adaptation and application of latent semantic indexing (LSI) to the task of email filtering. Experimental results show that LSI achieves significantly better classification results than the widespread de-facto standard method for this special application context. In the context of drug discovery problems, several groups of well discriminating descriptors could be identified by utilizing the ``sum-of-parts`` representation of NMF. The number of important descriptors could be further increased when applying sparseness constraints on the NMF factors

    libNMF -- A Library for Nonnegative Matrix Factorization

    Get PDF
    We present libNMF -- a computationally efficient high performance library for computing nonnegative matrix factorizations (NMF) written in C. Various algorithms and algorithmic variants for computing NMF are supported. libNMF is based on external routines from BLAS (Basic Linear Algebra Subprograms), LAPack (Linear Algebra package) and ARPack, which provide efficient building blocks for performing central vector and matrix operations. Since modern BLAS implementations support multi-threading, libNMF can exploit the potential of multi-core architectures. In this paper, the basic NMF algorithms contained in libNMF and existing implementations found in the literature are briefly reviewed. Then, libNMF is evaluated in terms of computational efficiency and numerical accuracy and compared with the best existing codes available. libNMF is publicly available at http://rlcta.univie.ac.at/software

    First results from the AugerPrime Radio Detector

    Get PDF

    Update of the Offline Framework for AugerPrime

    Get PDF

    Combined fit to the spectrum and composition data measured by the Pierre Auger Observatory including magnetic horizon effects

    Get PDF
    The measurements by the Pierre Auger Observatory of the energy spectrum and mass composition of cosmic rays can be interpreted assuming the presence of two extragalactic source populations, one dominating the flux at energies above a few EeV and the other below. To fit the data ignoring magnetic field effects, the high-energy population needs to accelerate a mixture of nuclei with very hard spectra, at odds with the approximate E2^{-2} shape expected from diffusive shock acceleration. The presence of turbulent extragalactic magnetic fields in the region between the closest sources and the Earth can significantly modify the observed CR spectrum with respect to that emitted by the sources, reducing the flux of low-rigidity particles that reach the Earth. We here take into account this magnetic horizon effect in the combined fit of the spectrum and shower depth distributions, exploring the possibility that a spectrum for the high-energy population sources with a shape closer to E2^{-2} be able to explain the observations

    A search for ultra-high-energy photons at the Pierre Auger Observatory exploiting air-shower universality

    Get PDF
    The Pierre Auger Observatory is the most sensitive detector to primary photons with energies above ∼0.2 EeV. It measures extensive air showers using a hybrid technique that combines a fluorescence detector (FD) with a ground array of particle detectors (SD). The signatures of a photon-induced air shower are a larger atmospheric depth at the shower maximum (Xmax_{max}) and a steeper lateral distribution function, along with a lower number of muons with respect to the bulk of hadron-induced background. Using observables measured by the FD and SD, three photon searches in different energy bands are performed. In particular, between threshold energies of 1-10 EeV, a new analysis technique has been developed by combining the FD-based measurement of Xmax_{max} with the SD signal through a parameter related to its muon content, derived from the universality of the air showers. This technique has led to a better photon/hadron separation and, consequently, to a higher search sensitivity, resulting in a tighter upper limit than before. The outcome of this new analysis is presented here, along with previous results in the energy ranges below 1 EeV and above 10 EeV. From the data collected by the Pierre Auger Observatory in about 15 years of operation, the most stringent constraints on the fraction of photons in the cosmic flux are set over almost three decades in energy

    Study on multi-ELVES in the Pierre Auger Observatory

    Get PDF
    Since 2013, the four sites of the Fluorescence Detector (FD) of the Pierre Auger Observatory record ELVES with a dedicated trigger. These UV light emissions are correlated to distant lightning strikes. The length of recorded traces has been increased from 100 μs (2013), to 300 μs (2014-16), to 900 μs (2017-present), to progressively extend the observation of the light emission towards the vertical of the causative lightning and beyond. A large fraction of the observed events shows double ELVES within the time window, and, in some cases, even more complex structures are observed. The nature of the multi-ELVES is not completely understood but may be related to the different types of lightning in which they are originated. For example, it is known that Narrow Bipolar Events can produce double ELVES, and Energetic In-cloud Pulses, occurring between the main negative and upper positive charge layer of clouds, can induce double and even quadruple ELVES in the ionosphere. This report shows the seasonal and daily dependence of the time gap, amplitude ratio, and correlation between the pulse widths of the peaks in a sample of 1000+ multi-ELVES events recorded during the period 2014-20. The events have been compared with data from other satellite and ground-based sensing devices to study the correlation of their properties with lightning observables such as altitude and polarity

    Studies of the mass composition of cosmic rays and proton-proton interaction cross-sections at ultra-high energies with the Pierre Auger Observatory

    Get PDF
    In this work, we present an estimate of the cosmic-ray mass composition from the distributions of the depth of the shower maximum (Xmax) measured by the fluorescence detector of the Pierre Auger Observatory. We discuss the sensitivity of the mass composition measurements to the uncertainties in the properties of the hadronic interactions, particularly in the predictions of the particle interaction cross-sections. For this purpose, we adjust the fractions of cosmic-ray mass groups to fit the data with Xmax distributions from air shower simulations. We modify the proton-proton cross-sections at ultra-high energies, and the corresponding air shower simulations with rescaled nucleus-air cross-sections are obtained via Glauber theory. We compare the energy-dependent composition of ultra-high-energy cosmic rays obtained for the different extrapolations of the proton-proton cross-sections from low-energy accelerator data
    corecore